Tamil Search Engine
نویسندگان
چکیده
The web creates new challenges for information retrieval. The amount of information on the web, as well as the number of new users is growing rapidly. Search engine technology has to scale dramatically to keep up with the growth of the web. In this paper, we present the details of constructing and maintaining a Tamil Search Engine. We discuss the issues such as the crawler, the database storage architecture the searcher and the functional modules of the search engine. 1 . Introduction With the phenomenal growth of the web, more and more information are available online through web, from personal data to scientific reports to up-to-the-minute satellite images. Human maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics. Hence, searching for a particular information in this is not only time consuming, but also a daunting task. Thus comes the need of a search engine. Search Engine is a software package that collects web pages on the internet through a robot (spider, crawler) program and stores the information on the pages with appropriate indexing to facilitate quick retrieval of desired information. The crawler continuously looks for updated information on the web and stores it in a database. An efficient indexing mechanism is normally used to quickly retrieve the informa tion. When a user queries the search engine for a particular topic, the search engine looks up the database and lists the pages containing information on that topic. In displaying these pages, some mechanism for ranking the relevant pages is used. Ranking can be done using the parameters such as number of occur rences of the word (location/frequency), the number of links coming into and out of the web pages etc[1]. This paper deals with the design of a Tamil search engine, that searches for Tamil documents on the web. Like any other search engine, it consists of a crawler, an indexing mechanism and a database to store the information. However the challenges faced for handling Tamil documents and the Tamil language are different. There are innumerable Tamil sites available on the net. Moreover, the format in which they are stored/represented are different. For instance, different fonts are ____________________________________________________________________________________ 68 used by different sites, each font having its own encoding. In spite of the standardi-zation that has been enforced by Tamilnadu Government (these standards are detailed very well on the Tamilnet99 web page [2]) which requires everybody to use TAM or TAB fonts, sites that have been hosted earlier have different encodings. Thus, the effective retrieval of content requires collecting information in all the fonts and converting it to a standard format before storing it. Another challenge is with respect to the language itself. Tamil being a highly inflectional language, every root word takes on innumerable forms due to the addition of suffixes indicating person, gender, tense, number, cases etc. Thus the number of words to be handled is large. This can be an issue in the design of the database. This problem has to be handled effectively. The following sections of this paper describe the design of the Tamil search engine that addresses these problems, to provide an efficient search mechanism for Tamil documents. 2. Design of Tamil Search Engine The Tamil Search Engine consists of the following major components: • Crawler • The Database System • Search The searcher component includes the ranking mechanism and the user interface. The design of each of the modules is described below: 2.1 Crawler Module The crawler module is responsible for retrieving pages from the web and handing them to the database management module. A list of seed URLs is taken as its input and sequentially traces through all the links in the list. Each document is downloaded and then the links contained in each document are extracted. These extracted links are added to the URL list if they are not already present. This process is repeated periodically to obtain the updated status of the web documents. The document are then processed to extract word information. The documents could be multilingual. However, only English and Tamil content are handled by the search engine. The language is first identified and the content processed accordingly. Tamil words and English words are maintained separately. The effective retrieval of content requires collecting information in all the fonts and converting it to standard format before processing. Words in different fonts are converted to TAB font using generic font converters. These font converters take care of most of the popular fonts (such as TAM kumudam, Vika tan, Kalki, TMNews, LT-TM-Barani, LT-TM-Lakshman, LT-TM-Kurinji, Amudham, Elango-TML-Pan chali-Normal, Tboomi, TM-TTKapilan, TM-TT Valluvan, Sarukesi, Tamilweb, Aabohi, Anantha_shan mugathas, Bamini/Baamini, DenukaPC, Eelanadu, Inaimathi, Inaikathir, Arulmathi, Webtamil, Anjal-Sys tem, Anjal-Text, Tamilnet). The documents may be in HTML, doc, PDF or PS formats. The processing of the HTML documents is done as follows. The documents are parsed and words from the body part of the HTML file are retrieved. Then the stop words (whose content value is very low) are removed. For the rest of the words, the corresponding root words are extracted using the morphological analyzer. The root word is stored into a word BTree and then into the database. Extraction of root words and storage of only root words in the database is an important design decision. Since, Tamil is an inflectional language, each root word combines with numerous suffixes, and appears in many different forms. Storage of each of these forms would prove to be too costly, and is also unnecessary. Often, a user may specify one form of the word, but typically expects all information regarding the root word. That is documents containing other forms of the word are also normally desired. Tamil Internet 2003, Chennai, Tamilnadu, India ____________________________________________________________________________________ 69 Thus, identifying the root words, and indexing the documents based on the root word, makes the task of retrieving all forms of the root word easy. Once the text part of the html contents is processed, the URL links in that page are identified. The URLs are identified by using the HTML tag such as , , and . The URL links are then inserted into the URL list which is maintained in a URL BTree. For PDF, doc and PS file, the first step in processing is the conversion of these files to text files. PDF and PS files are converted using the existing utilities. Once the text file is obtained, the same process of removing stop words and identifying root words using the morphological analyser is carried out. For all types of files, in addition to storing the root words in the database, the entire file is also stored locally as text files. This is required to present the content of the desired word in the document. The following figure (1) depicts the over all flow of the crawler.
منابع مشابه
Tamil Search Engine
The Internet marks the era of Information revolution. The Internet has been largely dominated by English till recently. The importance of reaching out to non-English speakers around the globe has been felt increasingly and this has lead to the spread of other languages on the Internet. Tamil is the fastest growing language in Internet among the Indian languages. With the number of Tamil website...
متن کاملiAgent : A System for Managing Networked Tamil and Multilingual Information Resources
The advent of World Wide Web(WWW) has created a novel means for information dissemination whereby information resources all over the world can be made available to a user connected to the net anywhere and anytime. As more and more information resources are becoming available on the WWW, providing easy access to these information resources has become a significant service. In this paper we prese...
متن کاملTamil-English Cross Lingual Information Retrieval System for Agriculture Society
Cross Lingual Information Retrieval (CLIR) system helps the users to pose the query in one language and retrieve the documents in another language. We developed a CLIR system in Agriculture domain for the Farmers of Tamil Nadu which helps them to specify their information need in Tamil and to retrieve the documents in English. In this paper, we address the issue of translating the given query i...
متن کاملA Comparing between the impacts of text based indexing and folksonomy on ranking of images search via Google search engine
Background and Aim: The purpose of this study was to compare the impact of text based indexing and folksonomy in image retrieval via Google search engine. Methods: This study used experimental method. The sample is 30 images extracted from the book “Gray anatomy”. The research was carried out in 4 stages; in the first stage, images were uploaded to an “Instagram” account so the images are tagge...
متن کاملReview of ranked-based and unranked-based metrics for determining the effectiveness of search engines
Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...
متن کامل